Stack Overflow Survey Exploratory Data Analysis(EDA) using PythonΒΆ
IntroductionΒΆ
In the realm of software development, Stack Overflow stands as a cornerstone platform, serving as a hub for developers worldwide to seek assistance, share knowledge, and engage with a vibrant community of peers. The annual Stack Overflow Developer Survey offers a plethora of insights into the demographics, preferences, and behaviors of developers across the globe.
This project delves into the Stack Overflow Developer Survey dataset, aiming to uncover valuable insights that illuminate the landscape of modern software development. By analyzing this dataset, the aim is to gain a deeper understanding of the developer community's composition, their technological preferences and their engagement with the platform.
- Data Extraction
- Data Prep and Cleaning
- Reusable Codes
3.1. Year Range Conversion 3.2. Colour Mapping 3.3. Pie chart Function 3.4. Grouped plots Function - Demographic Analysis
4.1. Age, Gender, location of survey takers.
4.2. Education and Employment Stats of survey takers. - Education and Professional Background
5.1. Educational background of survey takers and how it relates to their current roles.
5.2. Are survey takers actively engaged in continuous learning and their preferred professional development resources. - Programming language and Technology
6.1. Most commonly used programming language and technology used by survey takers.
6.2. Correlation between survey takers years of experience and the choice of programming language. - Stack Overflow Usage, Contribution and Community engagement
7.1. Proportion of survey takers have a stackoverflow account.
7.2. Frequency of survey takers visiting and participating in Q&A.
7.3. Frequency of survey takers contribution to open source projects and nature of their contribution. - Summary
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from wordcloud import WordCloud
Data ExtractionΒΆ
The Stack Overflow Developer Survey dataset provides a comprehensive snapshot of the global developer community, capturing a wide range of demographic, professional, and technological information. Here's a brief description of the dataset:
1. Demographic Information:
The dataset includes demographic information such as age, gender, country of residence, and employment status, providing insights into the diverse makeup of the developer community.
2. Professional Background:
Information about the respondents' professional backgrounds is included, such as their main branch (e.g., developer, student, manager), years of coding experience, education level, and industry affiliation.
3. Technological Preferences:
The dataset covers a wide array of technological preferences, including programming languages, databases, platforms, web frameworks, and developer tools used by the respondents in their day-to-day work.
4. Stack Overflow Engagement:
Data related to the respondents' engagement with the Stack Overflow platform is included, such as whether they have a Stack Overflow account, their frequency of visits, participation in Q&A, and their perception of the Stack Overflow community.
raw_df=pd.read_csv(r"C:\Users\swati\Desktop\DATA ANALYSIS\DataAnalyticswithpython\survey_results_public.csv")
raw_df_schema=pd.read_csv(r"C:\Users\swati\Desktop\DATA ANALYSIS\DataAnalyticswithpython\survey_results_schema.csv",index_col='qname')
raw_df_schema=raw_df_schema['question']
Th raw_df provides the comprehensive survey results. The raw_df_schema provides a description of each survey question.
Data Prep and CleaningΒΆ
The original dataset consists of 84 columns and 89184 rows of data. For this project only a subset of this data is required.
required_col_survey=[#DemographicAnalysis
'ResponseId',
'MainBranch',
'Country',
'Age',
'EdLevel',
'Employment',
'CodingActivities',
#EducationandProfessionalBackground
'YearsCode',
'YearsCodePro',
'LearnCode',
'LearnCodeOnline',
'LearnCodeCoursesCert',
'WorkExp',
#Employment Details
'DevType',
'OrgSize',
'CompTotal',
'Industry',
#Programming-language-and-Technology
'LanguageHaveWorkedWith',
'DatabaseHaveWorkedWith',
#Stackoverflow-Usage,-Contribution-and-Community-engagement
'NEWSOSites',
'SOVisitFreq',
'SOAccount',
'SOPartFreq',
'SOComm'
]
#subset of original raw survey data - 89184 rows Γ 24 columns
survey_df=raw_df[required_col_survey].copy()
required_col_schema=[#DemographicAnalysis
'ResponseId',
'MainBranch',
'Country',
'Age',
'EdLevel',
'Employment',
'CodingActivities',
#EducationandProfessionalBackground
'YearsCode',
'YearsCodePro',
'LearnCode',
'WorkExp',
#Employment Details
'DevType',
'OrgSize',
'CompTotal',
'ProfessionalTech',
'Industry',
#Programming-language-and-Technology
'Language',
'Database',
#Stackoverflow-Usage,-Contribution-and-Community-engagement
'NEWSOSites',
'SOVisitFreq',
'SOAccount',
'SOPartFreq',
'SOComm'
]
#subset of original schema
schema_df=raw_df_schema[required_col_schema].copy()
survey_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 89184 entries, 0 to 89183 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ResponseId 89184 non-null int64 1 MainBranch 89184 non-null object 2 Country 87973 non-null object 3 Age 89184 non-null object 4 EdLevel 87973 non-null object 5 Employment 87898 non-null object 6 CodingActivities 73764 non-null object 7 YearsCode 87435 non-null object 8 YearsCodePro 66136 non-null object 9 LearnCode 87663 non-null object 10 LearnCodeOnline 70084 non-null object 11 LearnCodeCoursesCert 37076 non-null object 12 WorkExp 43579 non-null float64 13 DevType 76872 non-null object 14 OrgSize 65043 non-null object 15 CompTotal 48225 non-null float64 16 Industry 36774 non-null object 17 LanguageHaveWorkedWith 87140 non-null object 18 DatabaseHaveWorkedWith 73435 non-null object 19 NEWSOSites 87973 non-null object 20 SOVisitFreq 87140 non-null object 21 SOAccount 87852 non-null object 22 SOPartFreq 66061 non-null object 23 SOComm 87692 non-null object dtypes: float64(2), int64(1), object(21) memory usage: 16.3+ MB
Some of the data types by default are set to object. In order to ensure numeric calculations are performed the data type is converted to numeric for those where it applies.
In the case of "YearsCode" which determines how many years someone has been coding, there are both numeric and string values in answers which can be seen from the list of unique data in this column. The two string values are "More than 50 years" and "Less than 1 year". For the current analysis, these values are replaced by nan.
survey_df['YearsCode'].unique()
survey_df['YearsCode']=pd.to_numeric(survey_df.YearsCode,errors='coerce')
survey_df['YearsCode'].unique()
array([nan, 6., 14., 16., 20., 5., 10., 3., 30., 15., 7., 4., 28.,
8., 40., 11., 12., 9., 13., 25., 35., 27., 38., 45., 2., 23.,
17., 24., 26., 1., 48., 31., 18., 34., 50., 21., 22., 33., 19.,
39., 37., 29., 42., 43., 32., 44., 49., 46., 41., 47., 36.])
Performing the same as above for "YearsCodePro".
survey_df['YearsCodePro'].unique()
survey_df['YearsCodePro']=pd.to_numeric(survey_df.YearsCodePro,errors='coerce')
survey_df['YearsCode'].unique()
array([nan, 6., 14., 16., 20., 5., 10., 3., 30., 15., 7., 4., 28.,
8., 40., 11., 12., 9., 13., 25., 35., 27., 38., 45., 2., 23.,
17., 24., 26., 1., 48., 31., 18., 34., 50., 21., 22., 33., 19.,
39., 37., 29., 42., 43., 32., 44., 49., 46., 41., 47., 36.])
survey_df.describe()
| ResponseId | YearsCode | YearsCodePro | WorkExp | CompTotal | |
|---|---|---|---|---|---|
| count | 89184.000000 | 86114.000000 | 64217.000000 | 43579.000000 | 4.822500e+04 |
| mean | 44592.500000 | 13.972049 | 11.508354 | 11.405126 | 1.036807e+42 |
| std | 25745.347541 | 10.259548 | 9.070346 | 9.051989 | 2.276847e+44 |
| min | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000e+00 |
| 25% | 22296.750000 | 6.000000 | 5.000000 | 5.000000 | 6.300000e+04 |
| 50% | 44592.500000 | 11.000000 | 9.000000 | 9.000000 | 1.150000e+05 |
| 75% | 66888.250000 | 20.000000 | 16.000000 | 16.000000 | 2.300000e+05 |
| max | 89184.000000 | 50.000000 | 50.000000 | 50.000000 | 5.000000e+46 |
One of the errors that can be noticed is the salary range ie, "CompTotal". The max value of 5e46 is clearly wrong and therefore drives the mean and standard deviation. Therefore, to ensure no incorrect data is included in the analysis CompTotal values are limited to an upper limit of 1e10 (10-figure salary).
survey_df.sort_values('CompTotal',ascending=False).CompTotal.head(5)
survey_df.drop(survey_df[survey_df['CompTotal']>1e10].index,inplace=True)
survey_df.sort_values('CompTotal',ascending=False).head(5).CompTotal
40971 1.000000e+10 40974 1.000000e+10 40975 1.000000e+10 40972 1.000000e+10 40973 1.000000e+10 Name: CompTotal, dtype: float64
survey_df.describe()
| ResponseId | YearsCode | YearsCodePro | WorkExp | CompTotal | |
|---|---|---|---|---|---|
| count | 89172.000000 | 86104.000000 | 64208.000000 | 43572.000000 | 4.821300e+04 |
| mean | 44593.487945 | 13.972405 | 11.509127 | 11.404962 | 1.156820e+07 |
| std | 25745.988811 | 10.259945 | 9.070699 | 9.052479 | 2.153491e+08 |
| min | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000e+00 |
| 25% | 22297.750000 | 6.000000 | 5.000000 | 5.000000 | 6.300000e+04 |
| 50% | 44594.500000 | 11.000000 | 9.000000 | 9.000000 | 1.150000e+05 |
| 75% | 66890.250000 | 20.000000 | 16.000000 | 16.000000 | 2.300000e+05 |
| max | 89184.000000 | 50.000000 | 50.000000 | 50.000000 | 1.000000e+10 |
Other numeric values are sensible and therefore no other rows are dropped.
schema_df['YearsCode']
'Including any education, how many years have you been coding in total?'
schema_df['YearsCodePro']
'NOT including education, how many years have you coded professionally (as a part of your work)?'
The survey takers may have entered or selected some contrasting answers - such as the years of professional coding (only professional coding) being greater than the total years of coding (which includes educational and professional coding). Such cases are dropped from the dataset.
#Drop cases where the years of professional coding is less than total years of coding
survey_df=survey_df[survey_df['YearsCode']>=survey_df['YearsCodePro']]
Next, it is also confirmed that the age range selected and years of coding are not contrasting, i.e., years of coding experience cannot be greater than the upper limit of the age group selected.
#There are also cases in the data set where the number of years of coding experience is greater the upper limit of age group
survey_df['Age'].unique()
survey_df_age1=survey_df[survey_df['Age']=='18-24 years old']
survey_df_age1=survey_df_age1[survey_df_age1['YearsCode']>=24]#values present
survey_df_age2=survey_df[survey_df['Age']=='25-34 years old']
survey_df_age2=survey_df_age2[survey_df_age2['YearsCode']>=34]
survey_df_age3=survey_df[survey_df['Age']=='35-44 years old']
survey_df_age3=survey_df_age3[survey_df_age3['YearsCode']>=44]
survey_df_age4=survey_df[survey_df['Age']=='55-64 years old']
survey_df_age4=survey_df_age4[survey_df_age4['YearsCode']>=64]
survey_df_age5=survey_df[survey_df['Age']=='Under 18 years old']
survey_df_age5=survey_df_age5[survey_df_age5['YearsCode']>=18]#values present
survey_df_age6=survey_df[survey_df['Age']=='45-54 years old']
survey_df_age6=survey_df_age6[survey_df_age6['YearsCode']>=54]
survey_df_age7=survey_df[survey_df['Age']=='65 years or older']
survey_df_age7=survey_df_age7[survey_df_age7['YearsCode']>=65]
#Ids where the years of experience is greater than age group upper limit
response_ids=list(survey_df_age1.ResponseId)+list(survey_df_age5.ResponseId)
survey_df= survey_df.drop(survey_df[survey_df['ResponseId'] == 33924].index)
survey_df= survey_df.drop(survey_df[survey_df['ResponseId'] == 81004].index)
survey_df= survey_df.drop(survey_df[survey_df['ResponseId'] == 65231].index)
survey_df= survey_df.drop(survey_df[survey_df['ResponseId'] == 86150].index)
survey_df= survey_df.drop(survey_df[survey_df['ResponseId'] == 71893].index)
No other discrepancies that may hinder the current analysis of the dataset are identified. A sample of the dataset for the current analysis is shown below.
survey_df.sample(10)
| ResponseId | MainBranch | Country | Age | EdLevel | Employment | CodingActivities | YearsCode | YearsCodePro | LearnCode | ... | OrgSize | CompTotal | Industry | LanguageHaveWorkedWith | DatabaseHaveWorkedWith | NEWSOSites | SOVisitFreq | SOAccount | SOPartFreq | SOComm | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19506 | 47151 | I am a developer by profession | India | 25-34 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Independent contractor, freelancer, or self-em... | Bootstrapping a business;Professional developm... | 10.0 | 7.0 | Books / Physical media;Online Courses or Certi... | ... | 500 to 999 employees | NaN | Information Services, IT, Software Development... | C;C#;Dart;HTML/CSS;JavaScript;PHP;Python;SQL;T... | Cloud Firestore;Dynamodb;Firebase Realtime Dat... | Stack Overflow;Stack Exchange;Stack Overflow f... | Multiple times per day | Yes | Less than once per month or monthly | Neutral |
| 79497 | 16728 | I am a developer by profession | Germany | 25-34 years old | Secondary school (e.g. American high school, G... | Employed, full-time | Hobby | 10.0 | 6.0 | Friend or family member;Other online resources... | ... | 100 to 499 employees | 55000.0 | Information Services, IT, Software Development... | TypeScript | NaN | Stack Overflow;Stack Exchange | A few times per week | Yes | I have never participated in Q&A on Stack Over... | No, not at all |
| 66672 | 81835 | I am a developer by profession | United States of America | 25-34 years old | Associate degree (A.A., A.S., etc.) | Employed, full-time | Hobby;Contribute to open-source projects;Boots... | 5.0 | 4.0 | On the job training;Other online resources (e.... | ... | 10,000 or more employees | 105000.0 | Information Services, IT, Software Development... | HTML/CSS;JavaScript;PHP;Python;SQL;TypeScript | Cloud Firestore;Elasticsearch;Firebase Realtim... | Stack Overflow | A few times per week | Not sure/can't remember | NaN | Yes, somewhat |
| 58120 | 24902 | I am not primarily a developer, but I write co... | United States of America | 35-44 years old | Professional degree (JD, MD, Ph.D, Ed.D, etc.) | Employed, full-time | Hobby;Contribute to open-source projects;Profe... | 10.0 | 8.0 | On the job training;Other online resources (e.... | ... | 100 to 499 employees | 160000.0 | NaN | Python;R | PostgreSQL | Stack Overflow | A few times per week | Yes | Less than once per month or monthly | Neutral |
| 86668 | 13878 | I am a developer by profession | Italy | 35-44 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | Hobby;Professional development or self-paced l... | 6.0 | 4.0 | Online Courses or Certification;On the job tra... | ... | 10,000 or more employees | 26500.0 | NaN | C#;HTML/CSS;SQL | Microsoft SQL Server | Stack Overflow | A few times per month or weekly | Yes | I have never participated in Q&A on Stack Over... | No, not at all |
| 37271 | 82770 | I am a developer by profession | United Kingdom of Great Britain and Northern I... | 35-44 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | Bootstrapping a business;Freelance/contract work | 20.0 | 15.0 | Books / Physical media;Online Courses or Certi... | ... | 1,000 to 4,999 employees | NaN | NaN | Bash/Shell (all shells);Go;Groovy;HTML/CSS;Jav... | Dynamodb;MariaDB;MySQL;Redis | Stack Overflow;Stack Exchange | A few times per week | Yes | Less than once per month or monthly | Yes, somewhat |
| 66994 | 832 | I am a developer by profession | France | 35-44 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Independent contractor, freelancer, or self-em... | Hobby | 30.0 | 25.0 | Books / Physical media;Online Courses or Certi... | ... | 100 to 499 employees | 100000.0 | Manufacturing, Transportation, or Supply Chain | C;C#;C++ | PostgreSQL;SQLite | Stack Overflow;Stack Exchange | Daily or almost daily | Yes | A few times per month or weekly | Yes, somewhat |
| 84477 | 17743 | I am a developer by profession | Portugal | 35-44 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | Professional development or self-paced learnin... | 13.0 | 10.0 | Books / Physical media;Online Courses or Certi... | ... | 100 to 499 employees | 38000.0 | Information Services, IT, Software Development... | Bash/Shell (all shells);C++;HTML/CSS;JavaScrip... | NaN | Stack Overflow;Stack Exchange | A few times per month or weekly | Not sure/can't remember | NaN | Yes, somewhat |
| 66418 | 60 | I am a developer by profession | Serbia | 25-34 years old | Master's degree (M.A., M.S., M.Eng., MBA, etc.) | Independent contractor, freelancer, or self-em... | Hobby;Contribute to open-source projects;Freel... | 8.0 | 4.0 | Online Courses or Certification;Other online r... | ... | 2 to 9 employees | 105000.0 | Legal Services | JavaScript;TypeScript | MongoDB | Stack Overflow;Stack Exchange;Collectives on S... | Multiple times per day | Yes | A few times per week | Yes, definitely |
| 79568 | 24726 | I am a developer by profession | United Kingdom of Great Britain and Northern I... | 35-44 years old | Some college/university study without earning ... | Independent contractor, freelancer, or self-em... | Hobby;Professional development or self-paced l... | 12.0 | 10.0 | Books / Physical media;Colleague;Online Course... | ... | 2 to 9 employees | 55000.0 | Information Services, IT, Software Development... | C#;JavaScript;SQL;TypeScript | Microsoft SQL Server | Stack Overflow | A few times per month or weekly | Yes | Less than once per month or monthly | Neutral |
10 rows Γ 24 columns
Reusable CodesΒΆ
To make the coding process simpler, certain reusable codes in terms of functions are defined, which are called throughout the analysis.
Year Range ConversionΒΆ
There are multiple different values for Years of coding experience in the dataset. For the current analysis, these years are grouped into ranges to better analyse their influence. This is performed by the YearCode_range function.
#Converting Number of Years of experience into ranges
def YearCode_range(row):
if row < 5:
x='Less than 5'
elif row >= 5 and row < 10:
x='Between 5-10 years'
elif row >= 10 and row < 20:
x='Between 10-20 years'
elif row >= 20 and row < 30:
x='Between 20-30 years'
elif row >= 30 and row < 40:
x='Between 30-40 years'
elif row >= 40 and row < 50:
x='Between 40-50 years'
else:
x='50+ years'
return x
Colour MappingΒΆ
The function color_matching assigns each text-based value to a certain colour.
#Color Mapping
def color_matching(item_list,color_set,Maingroup):
All_colors = pd.DataFrame({Maingroup: item_list,'Color': color_set})
color_map = {}
for i, row in All_colors.iterrows():
color_map[row[Maingroup]] = row['Color']
All_colors['Color'] = All_colors[Maingroup].map(color_map)
return All_colors
#Main Branch
Branch_list=survey_df['MainBranch'].unique()
colors_branch=color_matching(Branch_list,['skyblue', 'orange', 'green'],'MainBranch')
#Age Group
Age_list=survey_df['Age'].unique()
colors_Age=color_matching(Age_list,['firebrick','bisque','gold','darkorchid','plum','royalblue','lightcoral','forestgreen'],'Age')
#Employment Status
Emp_list=survey_df['Employment'].copy()
Emp_list=Emp_list.str.split(';')
Emp_list=Emp_list.explode('Employment')
Emp_list=Emp_list.drop_duplicates().unique()
colors_emp=color_matching(Emp_list,['dodgerblue','maroon','khaki','palegreen','darkturquoise','dimgray'],'Employment')
#Education Level
ED_list=survey_df['EdLevel'].unique()
colors_ED=color_matching(ED_list,['m','pink','darkorchid','palevioletred','plum','darkblue','rebeccapurple','mediumvioletred'],'EdLevel')
#Industry
Industry_list=survey_df['Industry'].unique()
colors_Industry=color_matching(Industry_list,['darkred','lightsalmon','dimgray','darkorange','steelblue','orchid','lime','mediumslateblue','darkolivegreen','cyan','thistle','purple','b'],'Industry')
#Coding Activities
Coding_list=survey_df['CodingActivities'].copy()
Coding_list=Coding_list.str.split(';')
Coding_list=Coding_list.explode('CodingActivities')
Coding_list=Coding_list.drop_duplicates().unique()
colors_Coding=color_matching(Coding_list,['crimson','salmon','navy','magenta','darkgoldenrod','lemonchiffon','sandybrown','dimgray','blueviolet'],'CodingActivities')
#Resources used to LearnCode
Resources_list=survey_df['LearnCode'].copy()
Resources_list=Resources_list.str.split(';')
Resources_list=Resources_list.explode('LearnCode')
Resources_list=Resources_list.drop_duplicates().unique()
colors_Resources=color_matching(Resources_list,['aquamarine','b','tomato','indigo','y','slategrey','deeppink','cornflowerblue','c','saddlebrown','dimgray'],'LearnCode')
#Developer Type
Dev_list=survey_df['DevType'].unique()
colors_Dev=color_matching(Dev_list,["turquoise", "orchid", "chartreuse", "skyblue", "limegreen", "darkorange",
"darkcyan", "thistle", "sandybrown", "dimgray", "darkkhaki", "peru",
"mediumaquamarine", "darkorchid", "lightcoral", "mediumvioletred", "darkseagreen",
"cadetblue", "lightpink", "dodgerblue", "mediumspringgreen", "darkgoldenrod",
"palegreen", "royalblue", "mediumslateblue", "saddlebrown", "steelblue",
"mediumturquoise", "darkolivegreen", "firebrick", "cornflowerblue","red","pink","orange"],'DevType')
#Programming Language Color
language_list=survey_df['LanguageHaveWorkedWith'].copy()
language_list=language_list.str.split(';')
language_list=language_list.explode('LanguageHaveWorkedWith')
language_list=language_list.drop_duplicates().unique()
All_Lang_colors=color_matching(language_list,['red', 'green', 'blue', 'yellow', 'orange', 'purple', 'cyan', 'magenta', 'lime',
'pink', 'teal', 'lavender', 'brown', 'beige', 'maroon', 'lightslategrey', 'olive', 'peru',
'navy', 'grey', 'white', 'black', 'turquoise', 'gold', 'skyblue', 'coral',
'violet', 'darkgreen', 'salmon', 'ivory', 'indigo', 'chartreuse', 'crimson',
'tan', 'khaki', 'silver', 'orchid', 'plum', 'azure', 'fuchsia', 'hotpink', 'dimgray',
'limegreen', 'bisque', 'chocolate', 'aqua', 'wheat', 'tomato', 'firebrick',
'mediumseagreen', 'mistyrose', 'rosybrown'],'LanguageHaveWorkedWith')
#Database Color
Database_list=survey_df['DatabaseHaveWorkedWith'].copy()
Database_list=Database_list.str.split(';')
Database_list=Database_list.explode('DatabaseHaveWorkedWith')
Database_list=Database_list.drop_duplicates().unique()
All_Database_colors=color_matching(Database_list,["red", "blue", "green", "yellow", "orange", "purple", "cyan", "magenta",
"lime", "pink",'dimgray', "teal", "brown", "gold", "silver", "olive", "indigo",
"turquoise", "tan", "coral", "orchid", "maroon", "salmon", "navy", "lavender",
"crimson", "burlywood", "fuchsia", "violet", "beige", "skyblue", "aquamarine", "khaki"
],'DatabaseHaveWorkedWith')
#SOAccount
SOAccount_list=survey_df['SOAccount'].unique()
colors_SO_acc=color_matching(SOAccount_list,['red','green','yellow','dimgray'],'SOAccount')
#Most popular stackover flow sites ?
SOSites_list=survey_df['NEWSOSites'].copy()
SOSites_list=SOSites_list.str.split(';')
SOSites_list=SOSites_list.explode('NEWSOSites')
SOSites_list=SOSites_list.drop_duplicates().unique() #5 options
colors_SO_sites=color_matching(SOSites_list,['blue','purple','lavender','forestgreen','orange'],'NEWSOSites')
#Site Visit
SOVisit_list=survey_df['SOVisitFreq'].unique()
colors_SO_VisitFreq=color_matching(SOVisit_list,['blue', 'orange', 'green', 'red', 'purple', 'dimgray'],'SOVisitFreq')
#Site Participation
SOPart_list=survey_df['SOPartFreq'].unique()
colors_SO_PartFreq=color_matching(SOPart_list,['dimgray','skyblue','brown','peru','crimson','lime','pink'],'SOPartFreq')
#Years of coding experience
YearsExp_list=survey_df.copy()
YearsExp_list['YearsCode_Range']=survey_df['YearsCode'].apply(YearCode_range)
YearsExp_list=YearsExp_list['YearsCode_Range'].unique()
colors_YearsExp=color_matching(YearsExp_list,['Indigo', 'Magenta', 'Teal', 'Maroon', 'Olive', 'Coral', 'Cyan'],'YearsCode_Range')
Pie chart FunctionΒΆ
The function draw_pie drafts a pie chart and assigns appropriate color based on the above function.
#Calculating percentage and drawing pie chart
def draw_pie(df,group1,figsize_x,figsize_y,fig_title,colors):
df['Perc']=(df['ResponseId']/(df['ResponseId']).sum())*100
fig,axes=plt.subplots(1,1,figsize=(figsize_x,figsize_y))
df= df.sort_values('Perc',ascending=False)['Perc']
color_set=df.reset_index().merge(colors,on=group1)['Color']
df.plot(kind='pie', colors=color_set,fontsize=8,autopct='%1.1f%%').axes.get_yaxis().set_visible(False)
plt.title(fig_title)
print(df)
Grouped plots FunctionΒΆ
For comparison of different parameters, it is easier to plot subplots. This is performed by the plot_grouped_bar_plots, plot_grouped_hist_plots and plot_grouped_pie_plots functions.
def plot_grouped_bar_plots(num_rows,num_cols,group1,group2,grouped_data,figsize_x,figsize_y,x_text,y_text,fig_title,colors,ranking=0):
group2=grouped_data.get(group2)
if num_rows==1:
if ranking==1:
fig, axes=plt.subplots(num_rows,num_cols,figsize=(figsize_x,figsize_y),sharex=False,sharey=False)
if ranking==0:
fig, axes=plt.subplots(num_rows,num_cols,figsize=(figsize_x,figsize_y),sharex=True,sharey=False)
for i, (group2, data) in enumerate(grouped_data.items()):
colors_x=colors
col = i % num_cols
if ranking==1:
#colors_set=grouped_data.merge(All_Lang_colors, on=group1)['Color']
grouped_data[group2 + 'Rank']=grouped_data[group2].rank(method='dense',ascending=False)
colors_x=grouped_data[grouped_data[group2 + 'Rank']<=10].merge(colors, on=group1)['Color']
data[grouped_data[group2 + 'Rank']<=10].plot.bar(ax=axes[col],color=colors_x)#list(color_x)
if ranking==0:
colors_x=grouped_data.merge(colors, on=group1)['Color']
data.plot.bar(ax=axes[col], color=colors_x)
axes[col].set_title(group2)
axes[col].set_xlabel(x_text)
axes[col].set_ylabel(y_text)
axes[col].tick_params(axis='x')
# Add annotations to the bars
for bar in axes[col].patches:
axes[col].annotate(str(bar.get_height()),
(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha='center', va='center', xytext=(0, 5),
textcoords='offset points')
fig.suptitle(fig_title, fontsize=20)
plt.tight_layout()
plt.show()
if num_rows>1:
if ranking==1:
fig, axes=plt.subplots(num_rows,num_cols,figsize=(figsize_x,figsize_y),sharex=False,sharey=False)
if ranking==0:
fig, axes=plt.subplots(num_rows,num_cols,figsize=(figsize_x,figsize_y),sharex=True,sharey=False)
# Loop through each main branch and create a bar plot for each category
for i, (group2, data) in enumerate(grouped_data.items()):
colors_x=colors
row = i // num_cols
col = i % num_cols
if ranking==1:
grouped_data[group2 + 'Rank']=grouped_data[group2].rank(method='dense',ascending=False)
colors_x=grouped_data[grouped_data[group2 + 'Rank']<=10].merge(colors, on=group1)['Color']
data[grouped_data[group2 + 'Rank']<=10].plot.bar(ax=axes[row][col],color=colors_x)
if ranking==0:
colors_x=grouped_data.merge(colors, on=group1)['Color']
data.plot.bar(ax=axes[row][col], color=colors_x)
axes[row][col].set_title(group2)
axes[row][col].set_xlabel(x_text)
axes[row][col].set_ylabel(y_text)
axes[row][col].tick_params(axis='x', rotation=90)
# Add annotations to the bars
for bar in axes[row][col].patches:
axes[row][col].annotate(str(bar.get_height()),
(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha='center', va='center', xytext=(0, 5),
textcoords='offset points')
fig.suptitle(fig_title, fontsize=20)
plt.tight_layout()
plt.show()
#Plotting multiple histograms
def plot_grouped_hist_plots(num_rows,num_cols,group1,group2,df,figsize_x,figsize_y,x_text,fig_title):
grouped_data = df.groupby(group1)
group1=df.get(group1)
fig, axes = plt.subplots(num_rows,num_cols,figsize=(figsize_x,figsize_y))
if num_rows==1:
for i, (group1, data) in enumerate(grouped_data):
col=i%num_cols
counts = data[group2]
# Plot histogram on the corresponding subplot
axes[col].hist(counts, edgecolor='black')
axes[col].set_title(f'{group1} ')
axes[col].set_xlabel(x_text)
axes[col].set_ylabel('Frequency')
if num_rows>1:
for i, (group1, data) in enumerate(grouped_data):
col=i%num_cols
row=i//num_cols
# Count the number of unique languages used in the current industry
counts = data[group2]
# Plot histogram on the corresponding subplot
axes[row][col].hist(counts, edgecolor='black')
axes[row][col].set_title(f'{group1} ')
axes[row][col].set_xlabel(x_text)
axes[row][col].set_ylabel('Frequency')
fig.suptitle(fig_title, fontsize=16)
# Adjust layout
plt.tight_layout()
plt.show()
#Plotting multiple pie plots
def plot_grouped_pie_plots(num_rows,num_cols,group1,group2,grouped_data,figsize_x,figsize_y,fig_title,colors,ranking=0):
group2=grouped_data.get(group2)
if num_rows==1:
fig, axes=plt.subplots(num_rows,num_cols,figsize=(figsize_x,figsize_y),sharex=False,sharey=False)
for i, (group2, data) in enumerate(grouped_data.items()):
colors_x=colors
col = i % num_cols
if ranking==1:
grouped_data[group2 + 'Rank']=grouped_data[group2].rank(method='dense',ascending=False)
colors_x=grouped_data[grouped_data[group2 + 'Rank']<=10].merge(colors, on=group1)['Color']
data[grouped_data[group2 + 'Rank']<=10].plot.pie(ax=axes[col],colors=colors_x,autopct='%1.1f%%')
if ranking==0:
colors_x=grouped_data.merge(colors, on=group1)['Color']
data.plot.pie(ax=axes[col], colors=colors_x,autopct='%1.1f%%')
axes[col].set_title(group2)
axes[col].axes.get_yaxis().set_visible(False)
axes[col].axes.get_xaxis().set_visible(False)
fig.suptitle(fig_title, fontsize=20)
plt.tight_layout()
plt.show()
if num_rows>1:
fig, axes=plt.subplots(num_rows,num_cols,figsize=(figsize_x,figsize_y),sharex=False,sharey=False)
# Loop through each main branch and create a pie plot for each category
for i, (group2, data) in enumerate(grouped_data.items()):
colors_x=colors
row = i // num_cols
col = i % num_cols
if ranking==1:
grouped_data[group2 + 'Rank']=grouped_data[group2].rank(method='dense',ascending=False)
colors_x=grouped_data[grouped_data[group2 + 'Rank']<=10].merge(colors, on=group1)['Color']
data[grouped_data[group2 + 'Rank']<=10].plot.pie(ax=axes[row][col],colors=colors_x,autopct='%1.1f%%')
if ranking==0:
colors_x=grouped_data.merge(colors, on=group1)['Color']
data.plot.pie(ax=axes[row][col], colors=colors_x,autopct='%1.1f%%')
axes[row][col].set_title(group2)
axes[row][col].axes.get_yaxis().set_visible(False)
axes[row][col].axes.get_xaxis().set_visible(False)
fig.suptitle(fig_title, fontsize=20)
plt.tight_layout()
plt.show()
Demographic AnalysisΒΆ
Understanding the demographics of the survey respondents provides valuable insights into the composition of the developer community and allows us to gain a deeper understanding of their backgrounds and contexts. In this section, the various demographic factors are analysed, shedding light on the diverse makeup of the survey respondents.
Main Branch AnalysisΒΆ
Starting with analyzing the main branch of the survey takers, which is a category survey responders would use to describe themselves - students, developers, people who code are not formally developers, previous developer etc. This analysis offers insights into the primary roles of the respondents.
Main_branch_group=survey_df.groupby('MainBranch').count()[['ResponseId']]
draw_pie(Main_branch_group,'MainBranch',8,8,'How would you describe yourself ?',colors_branch)
MainBranch I am a developer by profession 89.723570 I am not primarily a developer, but I write code sometimes as part of my work/studies 8.073209 I used to be a developer by profession, but no longer am 2.203221 Name: Perc, dtype: float64
90% of the survey takers are developers by profession. A distant second are those who are not primarily developers but sometimes write code as part of work or studies. The rest are those that were previously developers.
Age Group DistributionΒΆ
Next, the distribution of survey respondents across different age groups is analysed. This analysis provides insights into the generational diversity within the developer community.
#Calculating percentage and drawing pie chart
Age_group=survey_df.groupby('Age').count()[['ResponseId']]
draw_pie(Age_group,'Age',8,8,'Age Group',colors_Age)
Age 25-34 years old 43.833668 35-44 years old 29.100220 45-54 years old 11.735234 18-24 years old 9.394509 55-64 years old 4.586619 65 years or older 0.992556 Prefer not to say 0.195982 Under 18 years old 0.161211 Name: Perc, dtype: float64
Most survey takers are in the age group 25-34 years old. This makes sense as most survey takers are developers by profession. This is followed by by age groups 35-44 years old. Therefore, most survey takers are those who are in the workforce. Next are those who are either in the early phase of their careers (18-24 years) or towards the end of their careers (45-54 years). Those who are under 18 or over 55 form a very small percentage of the survey takers.
Top 10 Countries AnalysisΒΆ
Next, the top 10 countries from which the survey takers originate are identified. This analysis provides geographical insights into the distribution of developers globally and highlights the countries with the largest representation in the survey.
Country_group=survey_df.groupby('Country').count()[['ResponseId']]
Country_group['Perc_Country']=(Country_group['ResponseId']/(Country_group['ResponseId']).sum())*100
Country_group.sort_values('Perc_Country',ascending=False).head(10)
| ResponseId | Perc_Country | |
|---|---|---|
| Country | ||
| United States of America | 14344 | 22.670734 |
| Germany | 5190 | 8.202810 |
| United Kingdom of Great Britain and Northern Ireland | 4318 | 6.824612 |
| India | 3336 | 5.272558 |
| Canada | 2614 | 4.131435 |
| France | 2257 | 3.567195 |
| Netherlands | 1774 | 2.803812 |
| Poland | 1728 | 2.731109 |
| Australia | 1567 | 2.476648 |
| Brazil | 1505 | 2.378657 |
About 23% of all survey takers are from the United States of America . This is followed by Germany which makes up 8% and then United Kingdom and Northern Ireland and India which make up 7% and 5% respectively.
Country_group.sort_values('Perc_Country',ascending=False).head(10)['ResponseId'].plot(kind='bar', color="indigo")
plt.xlabel('Countries')
plt.ylabel('Count')
plt.title('Top 10 Countries in the survey takers demographic')
Text(0.5, 1.0, 'Top 10 Countries in the survey takers demographic')
top_ten=Country_group.sort_values('Perc_Country',ascending=False).head(10)
top_ten[top_ten.Perc_Country.cumsum()<=50].Perc_Country
Country United States of America 22.670734 Germany 8.202810 United Kingdom of Great Britain and Northern Ireland 6.824612 India 5.272558 Canada 4.131435 Name: Perc_Country, dtype: float64
50% of the survey takers are from one of these 5 countries - USA, Germany, UK and Northern Ireland, India or Canada.
Employment Group DistributionΒΆ
Exploring the distribution of survey respondents based on their employment status allows us to understand the employment landscape within the developer community.
Employment_group=survey_df.groupby('Employment').count()[['ResponseId']]
Employment_group['Perc_Employment']=(Employment_group['ResponseId']).sum()
Employment_group.index.unique()
Index(['Employed, full-time', 'Employed, full-time;Employed, part-time',
'Employed, full-time;Independent contractor, freelancer, or self-employed',
'Employed, full-time;Independent contractor, freelancer, or self-employed;Employed, part-time',
'Employed, full-time;Independent contractor, freelancer, or self-employed;Retired',
'Employed, full-time;Retired', 'Employed, part-time',
'Employed, part-time;Retired', 'I prefer not to say',
'Independent contractor, freelancer, or self-employed',
'Independent contractor, freelancer, or self-employed;Employed, part-time',
'Independent contractor, freelancer, or self-employed;Employed, part-time;Retired',
'Independent contractor, freelancer, or self-employed;Retired',
'Retired'],
dtype='object', name='Employment')
There are 63271 data values in the survey_df. Employment status question in the dataset allowed for multiple option selection. Therefore the dataset consists of the employment data as combinations. To identify each unique employment status, this data is first split and exploded to ensure all options are considered:
- Employed, full time
- Employed, Part time
- Student, fulltime
- Student, parttime
- Independent contractor, freelancer, or self-employed
- Not employed,but looking for work
- Not employed, and not looking for work
- Retired
- Prefer not to say
Because multiple options are possible, it is important to ensure the selections are not conflicting. One such conflicting selection is Employed Full time and Retired. Therefore, to ensure these conflicting data are not considered, responses with this selection are dropped.
emp_analysis=survey_df.copy()
emp_analysis['Employment']=emp_analysis['Employment'].str.split(';')
emp_analysis=emp_analysis[['ResponseId','Employment']]
emp_analysis['Count']=emp_analysis['Employment'].str.len()
emp_analysis['EmpChoice1']=emp_analysis['Employment'].str[0]
emp_analysis['EmpChoice2']=emp_analysis['Employment'].str[1]
emp_analysis['EmpChoice3']=emp_analysis['Employment'].str[3]
emp_analysis=emp_analysis[emp_analysis['EmpChoice1']=='Employed, full-time']
emp_analysis=emp_analysis[emp_analysis['EmpChoice2']=='Retired']
data_set=survey_df.copy()#63271 rows Γ 24 columns
for i in emp_analysis.index:
data_set=data_set.drop(data_set[data_set['ResponseId']==emp_analysis.loc[i]['ResponseId']].index)
data_set #63260 rows Γ 24 columns
survey_df=data_set
emp_analysis=survey_df.copy()
emp_analysis['Employment']=emp_analysis['Employment'].str.split(';')
emp_analysis=emp_analysis.explode('Employment')
emp_group=emp_analysis.groupby('Employment').count()[['ResponseId']]
draw_pie(emp_group,'Employment',8,8,'Employment Status',colors_emp)
Employment Employed, full-time 80.070186 Independent contractor, freelancer, or self-employed 16.313789 Employed, part-time 2.940488 Retired 0.463518 I prefer not to say 0.212019 Name: Perc, dtype: float64
80% of the survey takers are Employed full time. This correlates with the finding that most survey takers are developers by profession. This is followed distantly by Independent contractor, freelancer, or self-employed. The rest form a very low percentage.
Education Level Distribution:ΒΆ
An analysis of the highest level of education of the survey takers is also analysed. This allows understanding of the different educational backgrounds of the developers.
EDlevel_group=survey_df.groupby('EdLevel').count()[['ResponseId']]
draw_pie(EDlevel_group,'EdLevel',8,8,'Education Level',colors_ED)
EdLevel Bachelor's degree (B.A., B.S., B.Eng., etc.) 46.130256 Master's degree (M.A., M.S., M.Eng., MBA, etc.) 27.967120 Some college/university study without earning a degree 11.430604 Professional degree (JD, MD, Ph.D, Ed.D, etc.) 5.196016 Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.) 4.454632 Associate degree (A.A., A.S., etc.) 3.141005 Something else 1.160291 Primary/elementary school 0.520076 Name: Perc, dtype: float64
Almost 50% of the survey takers have the highest degree - Bachelor's degree (B.A., B.S., B.Eng., etc.). This is followed by Master's degree (M.A., M.S., M.Eng., MBA, etc.). About 5% of the survey takers are still in school and just a little over 5% have a Professional degree (JD, MD, Ph.D, Ed.D, etc.).
Industry Group DistributionΒΆ
Finally, the distribution of survey respondents across different industries is analysed.
Industry_group=survey_df.groupby('Industry').count()[['ResponseId']]
draw_pie(Industry_group,'Industry',8,8,'Industry',colors_Industry)
Industry Information Services, IT, Software Development, or other Technology 49.526109 Financial Services 12.166190 Other 10.851812 Manufacturing, Transportation, or Supply Chain 7.024917 Healthcare 6.077134 Retail and Consumer Services 5.376729 Higher Education 3.117549 Advertising Services 2.139962 Insurance 1.931330 Oil & Gas 0.724249 Legal Services 0.572246 Wholesale 0.491774 Name: Perc, dtype: float64
50% of all survey takers are from the IT sector, followed by Financial Services.
Education and Professional BackgroundΒΆ
Understanding the educational backgrounds and professional trajectories of survey respondents is crucial for gaining insights into their skills, expertise, and career paths. In this section, various aspects of the education and professional background of the survey respondents are analysed, shedding light on their qualifications, employment status, coding activities, and resources used.
ED_Emp_analysis=survey_df.copy()
ED_Emp_analysis['Employment']=ED_Emp_analysis['Employment'].str.split(';')
ED_Emp_analysis=ED_Emp_analysis.explode('Employment')
ED_Emp_analysis['CodingActivities']=ED_Emp_analysis['CodingActivities'].str.split(';')
ED_Emp_analysis=ED_Emp_analysis.explode('CodingActivities')
ED_Emp_analysis['LearnCode']=ED_Emp_analysis['LearnCode'].str.split(';')
ED_Emp_analysis=ED_Emp_analysis.explode('LearnCode')
ED_Emp_relation=ED_Emp_analysis[['ResponseId','MainBranch','Age','EdLevel','Employment','YearsCode','CodingActivities','LearnCode','WorkExp','Industry']]
ED_Emp_relation
| ResponseId | MainBranch | Age | EdLevel | Employment | YearsCode | CodingActivities | LearnCode | WorkExp | Industry | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 11 | I am a developer by profession | 25-34 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | 14.0 | Hobby | Books / Physical media | 3.0 | Other |
| 2 | 11 | I am a developer by profession | 25-34 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | 14.0 | Hobby | Friend or family member | 3.0 | Other |
| 2 | 11 | I am a developer by profession | 25-34 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | 14.0 | Hobby | Other online resources (e.g., videos, blogs, f... | 3.0 | Other |
| 2 | 11 | I am a developer by profession | 25-34 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | 14.0 | Hobby | School (i.e., University, College, etc) | 3.0 | Other |
| 3 | 27 | I am a developer by profession | 25-34 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | 16.0 | Hobby | Books / Physical media | 9.0 | Information Services, IT, Software Development... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 89181 | 88011 | I am a developer by profession | 35-44 years old | Some college/university study without earning ... | Independent contractor, freelancer, or self-em... | 23.0 | Professional development or self-paced learnin... | Online Courses or Certification | NaN | NaN |
| 89181 | 88011 | I am a developer by profession | 35-44 years old | Some college/university study without earning ... | Independent contractor, freelancer, or self-em... | 23.0 | Professional development or self-paced learnin... | Other online resources (e.g., videos, blogs, f... | NaN | NaN |
| 89183 | 88950 | I am not primarily a developer, but I write co... | 65 years or older | Secondary school (e.g. American high school, G... | Retired | 46.0 | NaN | Books / Physical media | NaN | NaN |
| 89183 | 88950 | I am not primarily a developer, but I write co... | 65 years or older | Secondary school (e.g. American high school, G... | Retired | 46.0 | NaN | Colleague | NaN | NaN |
| 89183 | 88950 | I am not primarily a developer, but I write co... | 65 years or older | Secondary school (e.g. American high school, G... | Retired | 46.0 | NaN | On the job training | NaN | NaN |
464180 rows Γ 10 columns
Main Branch and Education Level, Professional Background, Industry ComparisonΒΆ
Beginning by comparing the main branch of survey takers with their highest level of formal education completed. This analysis provides insights into the educational qualifications of developers across different roles within the developer community.
grouped_data = ED_Emp_relation.groupby(['EdLevel','MainBranch' ]).size().unstack()
plot_grouped_bar_plots(1,3,'EdLevel','MainBranch',grouped_data,20,16,'EdLevel','Count','Comparison of Main Branch and Education Level \n',colors_ED)
Inference:
1. I am a developer by profession: Most developers in the workforce have a Bachelor's degree as their highest level of education. A master's degree is the next highest degree followed by some university study without a degree. Indicating it is the skills that are more important than the formal education.
2.I am not primarily a developer, but I write code sometimes as part of my work/studies: The number of people who selected this main branch have either a Bachelor's or Master's degree. This group is also the one that has the relatively highest number of developers with a Professional Degree. Therefore, for certain jobs that may primarily be related to developers, a higher education level is preferred. The number of survey takers without a degree is a relatively lower percentage than those who are developers.
3.I used to be a developer by profession, but no longer am:Similar to those who are not primarily developers, survey takers who were previously developers but are now in other professions have a similar number of people with a bachelor's and Master's degree, again highlighting that to become a good developer a Bachelor's degree is sufficient. Also, the number of people without a degree is slightly higher than those with a Professional Degree. Therefore, it comes down to the type of job.
Next the Main branch and the employment status are compared:
grouped_data=ED_Emp_relation.groupby(['Employment','MainBranch']).size().unstack()
plot_grouped_bar_plots(1,3,'Employment','MainBranch',grouped_data,20,13,'Employment','Count','Comparison of Main Branch and Employment Status \n',colors_emp)
Inference:
- Regardless of the main branch, most survey takers are employed full time.
- Survey takers who used to be developers, have a higher percentage of self-employed or retired employment status than the other branches.
Finally, the Main branch is compared to the type of industry:
grouped_data = ED_Emp_relation.groupby(['Industry','MainBranch']).size().unstack()
plot_grouped_bar_plots(1,3,'Industry','MainBranch',grouped_data,20,13,'Industry','Count','Comparison of Industry and Main Branch \n',colors_Industry)
Inference:
- Most developers regardless of their Main Branch work in the IT sector.
- For current and previous developers, the Financial Services industry is also a popular choice, but only a distant second. For those who code as part of their work or studies, a realtively higher variation in industry employed in is noticed.
- A higher percentage of those who work in higher education or Manufacturing are those who are not primarily developers but write code as part of their work.
Comparison of Education level and Professional statusΒΆ
Now comparing the education level and the professional Status. This comparison offers insights into the correlation between educational qualifications and employment opportunities within the developer community.
df=ED_Emp_relation.groupby([ 'Employment','EdLevel']).size().unstack()
plot_grouped_bar_plots(4,2,'Employment','EdLevel',df,16,16,'Employment Status','Count','Comparison of Employment Status and Education Level',colors_emp)
Inference:
- Regardless of the education level, most survey takers are either employed full time or self employed.
- In each education level, a higher percentage of people are self employed than employed part time.
- A higher percentage of people who are freelance/self-employed and employed part time are those who are either still in school or have been to university without earning a degree.
- People with a formal degree are more likely to be employed full time.
Comparison of Coding Activity performed outside work and Education level and Professional statusΒΆ
Coding activities in the survey have the below options possible:
- Hobby
- Contribute to open-source projects
- Freelance/contract work
- Professional development or self-paced learning from online courses
- Other
- Bootstrapping a business
- I don't code outside of work
- School or academic work
Again as in the case of employment status, multiple options are possible. An example of a conflicting selection would be I don't code outside of work and any other Coding Activity option. Therefore, it is checked if any conflicting options are selected:
#checking if for cases where more than one answer is selected in Coding activities outside if conflicting answers are selected such as 'I don't code outside work' and any other option
Coding_Act=survey_df.copy()
Coding_Act['CodingActivities']=Coding_Act['CodingActivities'].str.split(';')
Coding_Act['CodingActivities_options']=Coding_Act['CodingActivities'].str.len()
df=Coding_Act[['ResponseId','CodingActivities','CodingActivities_options']].sort_values('CodingActivities_options',ascending=False)[Coding_Act.CodingActivities_options>1]
df=df[df.CodingActivities_options>1]
ind=df.index
ind
ind_yes=[]
ind_no=[]
count_yes=0
count_no=0
for x in ind:
if "I don't code outside work" in df.loc[x].CodingActivities:
ind_yes.append(df.loc[x].ResponseId)
count_yes+=1
else:
ind_no.append(df.loc[x].ResponseId)
count_no+=1
print("Number of Conflicting Selections: "+ str(count_yes))
print("Number of Valid Selections: "+ str(count_no))
Number of Conflicting Selections: 0 Number of Valid Selections: 33580
No conflicting options are selected. Therefore the dataset needs no changes.
Coding Activities DistributionΒΆ
Coding_Act=ED_Emp_relation[['ResponseId','CodingActivities']]
Coding_Act=Coding_Act.dropna(subset='CodingActivities')
Coding_Act=Coding_Act.drop_duplicates()
Coding_Act=Coding_Act.groupby(['CodingActivities']).count()[['ResponseId']]
draw_pie(Coding_Act,'CodingActivities',8,8,'Coding Activities',colors_Coding)
CodingActivities Hobby 37.853395 Professional development or self-paced learning from online courses 19.714086 Contribute to open-source projects 13.314722 Freelance/contract work 10.169492 Bootstrapping a business 7.640515 I don't code outside of work 7.140943 School or academic work 3.325871 Other 0.840975 Name: Perc, dtype: float64
About 40% of the survey takers code outside of work as a Hobby, followed by Professional Development and Contribution to open source projects. Only 7% of survey takers don't code outside of work. Around 10% and & 7.6% of survey takers are indulged in freelance work or bootstrapping a business respectively.
Comparison of Coding Activities performed and Main BranchΒΆ
How do survey takers that have different branches code outside of work ?
#Variation with Branch
Branch_coding=ED_Emp_relation[['ResponseId','CodingActivities','MainBranch']]
Branch_coding=Branch_coding.dropna(subset='CodingActivities')
Branch_coding=Branch_coding.dropna(subset='MainBranch')
Branch_coding=Branch_coding.drop_duplicates()
df=Branch_coding[['ResponseId','CodingActivities','MainBranch']]
df=df.groupby(['MainBranch','CodingActivities']).size().unstack()
plot_grouped_bar_plots(4,2,'MainBranch','CodingActivities',df,16,20,'MainBranch','Count','Comparison of Coding Activities outside work and Main Branch \n',colors_branch)
Inference:
- Regardless of the coding activities, most of these activities are performed by developers, given that a higher percentage of the survey takers are developers.
- The highest percentage of those who code as part of professional development or as part of school or academics are people who are not primarily developers but code as part of work or school.
Comparison of Coding Activities performed and Main BranchΒΆ
What education level would be required to perform certain coding activities ?
#Variation with Education Level
ED_coding=ED_Emp_relation[['ResponseId','CodingActivities','EdLevel']]
ED_coding=ED_coding.dropna(subset='CodingActivities')
ED_coding=ED_coding.dropna(subset='EdLevel')
ED_coding=ED_coding.drop_duplicates()
df=ED_coding[['ResponseId','CodingActivities','EdLevel']]
df=df.groupby(['EdLevel','CodingActivities']).size().unstack()
plot_grouped_bar_plots(4,2,'EdLevel','CodingActivities',df,16,16,'EdLevel','Count','Comparison of Coding Activities outside work and Education Level \n',colors_ED)
Inference:
- A bachelor's degree is sufficient to perform most of these coding activities.
- Those who perform coding as part of school or academic work are those who are either enrolled or have a higher level of education such as a Professional degree.
Comparison of Coding Activities performed and Employment StatusΒΆ
How does employment status influence coding activities performed outside of work ?
#Variation with employment status
Emp_coding=ED_Emp_relation[['ResponseId','CodingActivities','Employment']]
Emp_coding=Emp_coding.dropna(subset='CodingActivities')
Emp_coding=Emp_coding.dropna(subset='Employment')
Emp_coding['Employment']=Emp_coding['Employment'].str.split(';')
Emp_coding=Emp_coding.explode('Employment')
Emp_coding=Emp_coding.drop_duplicates()
df=Emp_coding[['ResponseId','CodingActivities','Employment']]
df=df.groupby(['Employment','CodingActivities']).size().unstack()
plot_grouped_bar_plots(4,2,'Employment','CodingActivities',df,16,16,'Employment','Count','Comparison of Coding Activities outside work and Employment Status \n',colors_emp)
Inference:
- Most of the people who perform any coding activities outside of work, regardless of the activity, are empolyed full time.
- A higher percentage of Independent contractors, freelancers and self-employed survey takers are involved in coding activities such as Bootstrapping a business, freelance work and open source projects.
- Most retired survey takers are still involved in coding as a hobby.
- Survey takers who are employed full time are involved in coding activities as part of Hobby, professional development or contributing to open source projects.
Comparison of Coding Activities performed and Age GroupΒΆ
How do coding activities performed outside of work change with certain age groups?
#Variation with age
Age_coding=ED_Emp_relation[['ResponseId','CodingActivities','Age']]
Age_coding=Age_coding.dropna(subset='CodingActivities')
Age_coding=Age_coding.dropna(subset='Age')
Age_coding=Age_coding.drop_duplicates()
df=Age_coding[['ResponseId','CodingActivities','Age']]
df=df.groupby([ 'CodingActivities','Age']).size().unstack()
plot_grouped_bar_plots(4,2,'CodingActivities','Age',df,16,16,'CodingActivities','Count','Comparison of Coding Activities outside work and Age Group \n',colors_Coding)
Inference: The analysis doesn't present any particularly egregious results, but confirms what is usually expected.
- Regardless of the age group, coding as part of a hobby is still dominant.
- Older people (65 or older) have a higher percentage of those who don't code outside work.
- Younger people (under 18 and 18-24 years old) are more likely to perfom coding as part of academics than any other age group.
- People in the work force (25-34 and 35-44 years old) are more likely to bootstrapping business, freelance work, and open source projects.
- Older people (65 or older) have a higher percentage of those who code as part of professional development or freelance work.
Comparison of Coding Activities performed and Years of coding experienceΒΆ
How do coding activities performed outside of work change with years of coding experience?
#Variation with number of years
Exp_coding=ED_Emp_relation[['ResponseId','CodingActivities','YearsCode']]
Exp_coding=Exp_coding.dropna(subset='CodingActivities')
Exp_coding=Exp_coding.dropna(subset='YearsCode')
Exp_coding=Exp_coding.drop_duplicates()
Exp_coding['YearsCode_Range']=Exp_coding['YearsCode'].apply(YearCode_range)
df=Exp_coding.groupby(['YearsCode_Range','CodingActivities']).size().unstack()
plot_grouped_bar_plots(4,2,'YearsCode_Range','CodingActivities',df,16,16,'YearsCode_Range','Count','Comparison of Coding Activities outside work and Years of coding Experience \n',colors_YearsExp)
Inference: As in the previous analysis, variation of coding activities with years of coding experience doesn't present any particularly egregious results, but confirms what is usually expected.
- Regardless of the years of experience, coding as part of a hobby is still dominant.
- It would be expected that people with the most years of experience (50+ years) would be more involved in bootstrapping business, freelance or open source projects. However, most of them code as a hobby.
- People who have just entered the work force , with limited experience (under 5 years), are mostly coding as part of academic work or professional development.
- Open source projects, bootstrapping a business or freelance projects are mostly performed by those who have significant experience and are in the work force (Years of experience between 5-20 years).
Professional DevelopmentΒΆ
Are survey takers actively engaged in continuous learning ? To determine this cases where no resources for continuous learning are identified and compared to those that do. Only 0.2% of the survey takers are not actively engaged in continuous learning.
#Are survey takers actively engaged in continuous learning and their preferred professional development resources.
No_learning=survey_df[survey_df['LearnCode'].isna()].count()['ResponseId']
learning=len(survey_df['ResponseId'])-No_learning
Continous_learning=pd.DataFrame(data=[['Learning',learning],['No Learning',No_learning]],columns=['Learning','Count'])
Continous_learning.Count.plot(kind="pie",labels=Continous_learning.Learning).axes.get_yaxis().set_visible(False)
Continous_learning
| Learning | Count | |
|---|---|---|
| 0 | Learning | 63114 |
| 1 | No Learning | 146 |
Most survey takers are engaged in continuous learning. What resources they prefer are analysed below.
Resources= survey_df[['ResponseId','MainBranch','Country','Age','EdLevel','Employment','CodingActivities','YearsCode','LearnCode','Industry']]
Resources=Resources.dropna(subset=['LearnCode'])
Resources['LearnCode']=Resources['LearnCode'].str.split(';')
Resources['ResourceCount']=Resources['LearnCode'].str.len()
Resources=Resources.explode('LearnCode')
Resources['Employment']=Resources['Employment'].str.split(';')
Resources=Resources.explode('Employment')
Resources['CodingActivities']=Resources['CodingActivities'].str.split(';')
Resources=Resources.explode('CodingActivities')
Resources.drop_duplicates()
| ResponseId | MainBranch | Country | Age | EdLevel | Employment | CodingActivities | YearsCode | LearnCode | Industry | ResourceCount | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 11 | I am a developer by profession | United States of America | 25-34 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | Hobby | 14.0 | Books / Physical media | Other | 4 |
| 2 | 11 | I am a developer by profession | United States of America | 25-34 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | Hobby | 14.0 | Friend or family member | Other | 4 |
| 2 | 11 | I am a developer by profession | United States of America | 25-34 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | Hobby | 14.0 | Other online resources (e.g., videos, blogs, f... | Other | 4 |
| 2 | 11 | I am a developer by profession | United States of America | 25-34 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | Hobby | 14.0 | School (i.e., University, College, etc) | Other | 4 |
| 3 | 27 | I am a developer by profession | Belgium | 25-34 years old | Bachelor's degree (B.A., B.S., B.Eng., etc.) | Employed, full-time | Hobby | 16.0 | Books / Physical media | Information Services, IT, Software Development... | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 89181 | 88011 | I am a developer by profession | New Zealand | 35-44 years old | Some college/university study without earning ... | Independent contractor, freelancer, or self-em... | Bootstrapping a business | 23.0 | Other online resources (e.g., videos, blogs, f... | NaN | 3 |
| 89181 | 88011 | I am a developer by profession | New Zealand | 35-44 years old | Some college/university study without earning ... | Independent contractor, freelancer, or self-em... | Professional development or self-paced learnin... | 23.0 | Other online resources (e.g., videos, blogs, f... | NaN | 3 |
| 89183 | 88950 | I am not primarily a developer, but I write co... | Australia | 65 years or older | Secondary school (e.g. American high school, G... | Retired | NaN | 46.0 | Books / Physical media | NaN | 3 |
| 89183 | 88950 | I am not primarily a developer, but I write co... | Australia | 65 years or older | Secondary school (e.g. American high school, G... | Retired | NaN | 46.0 | Colleague | NaN | 3 |
| 89183 | 88950 | I am not primarily a developer, but I write co... | Australia | 65 years or older | Secondary school (e.g. American high school, G... | Retired | NaN | 46.0 | On the job training | NaN | 3 |
463876 rows Γ 11 columns
Resources_used=Resources[['ResponseId','LearnCode']]
Resources_used=Resources_used.dropna(subset='LearnCode')
Resources_used=Resources_used.drop_duplicates()
Resources_used=Resources_used.groupby(['LearnCode']).count()[['ResponseId']]
draw_pie(Resources_used,'LearnCode',8,8,'Resources used',colors_Resources)
LearnCode Other online resources (e.g., videos, blogs, forum) 22.931451 Books / Physical media 15.965872 On the job training 15.563044 Online Courses or Certification 14.380167 School (i.e., University, College, etc) 14.146975 Colleague 7.723224 Friend or family member 2.889293 Coding Bootcamp 2.532190 Hackathons (virtual or in-person) 2.176915 Other 1.690870 Name: Perc, dtype: float64
Inference: No particular resource is a clear favourite among survey takers. A combination of resources are preferred by responders.
- Most survey takers use Online Resources such as videos, blogs, forums etc as their main resource for continuous learning.
- This is followed by books or physical media, job training,online courses or schools.
Average Resource use across different categoriesΒΆ
Variation of average resources used by survey takers with different age, education level, experience, industry, coding activities and years of experience ?
Numb_Resources= survey_df[['ResponseId','MainBranch','Country','Age','EdLevel','Employment','CodingActivities','YearsCode','LearnCode','Industry']]
Numb_Resources=Numb_Resources.dropna(subset=['LearnCode'])
Numb_Resources['LearnCode']=Numb_Resources['LearnCode'].str.split(';')
Numb_Resources['ResourceCount']=Numb_Resources['LearnCode'].str.len()
Numb_Resources_Age=Numb_Resources[['ResponseId','Age','ResourceCount']]
Numb_Resources_Age=Numb_Resources_Age.dropna(subset='Age')
Numb_Resources_Age=Numb_Resources_Age.dropna(subset='ResourceCount')
Numb_Resources_Age=Numb_Resources_Age.drop_duplicates()
plot_grouped_hist_plots(2,4,'Age','ResourceCount',Numb_Resources_Age,16,8,'Number of Resources',' How many Resources do different Age groups use? \n')
Numb_Resources_branch=Numb_Resources[['ResponseId','MainBranch','ResourceCount']]
Numb_Resources_branch=Numb_Resources_branch.dropna(subset='MainBranch')
Numb_Resources_branch=Numb_Resources_branch.dropna(subset='ResourceCount')
Numb_Resources_branch=Numb_Resources_branch.drop_duplicates()
plot_grouped_hist_plots(1,3,'MainBranch','ResourceCount',Numb_Resources_branch,16,5,'Number of Resources',' How many Resources do different Main Branches use? \n')
Numb_Resources_ED=Numb_Resources[['ResponseId','EdLevel','ResourceCount']]
Numb_Resources_ED=Numb_Resources_ED.dropna(subset='EdLevel')
Numb_Resources_ED=Numb_Resources_ED.dropna(subset='ResourceCount')
Numb_Resources_ED=Numb_Resources_ED.drop_duplicates()
plot_grouped_hist_plots(2,4,'EdLevel','ResourceCount',Numb_Resources_ED,16,8,'Number of Resources',' How many Resources do different Education Levels use? \n')
Numb_Resources_emp=Numb_Resources[['ResponseId','Employment','ResourceCount']]
Numb_Resources_emp['Employment']=Numb_Resources_emp['Employment'].str.split(';')
Numb_Resources_emp=Numb_Resources_emp.explode('Employment')
Numb_Resources_emp=Numb_Resources_emp[Numb_Resources_emp['Employment']!='I prefer not to say']
Numb_Resources_emp=Numb_Resources_emp.dropna(subset='Employment')
Numb_Resources_emp=Numb_Resources_emp.dropna(subset='ResourceCount')
Numb_Resources_emp=Numb_Resources_emp.drop_duplicates()
plot_grouped_hist_plots(1,4,'Employment','ResourceCount',Numb_Resources_emp,16,5,'Number of Resources',' How many Resources do different Employment Statuses use? \n')
Numb_Resources_Industry=Numb_Resources[['ResponseId','Industry','ResourceCount']]
Numb_Resources_Industry=Numb_Resources_Industry.dropna(subset='Industry')
Numb_Resources_Industry=Numb_Resources_Industry.dropna(subset='ResourceCount')
Numb_Resources_Industry=Numb_Resources_Industry.drop_duplicates()
plot_grouped_hist_plots(3,4,'Industry','ResourceCount',Numb_Resources_Industry,16,8,'Number of Resources',' How many Resources do different Industries use? \n')
Numb_Resources_Coding=Numb_Resources[['ResponseId','CodingActivities','ResourceCount']]
Numb_Resources_Coding['CodingActivities']=Numb_Resources_Coding['CodingActivities'].str.split(';')
Numb_Resources_Coding=Numb_Resources_Coding.explode('CodingActivities')
Numb_Resources_Coding=Numb_Resources_Coding.dropna(subset='CodingActivities')
Numb_Resources_Coding=Numb_Resources_Coding.dropna(subset='ResourceCount')
Numb_Resources_Coding=Numb_Resources_Coding.drop_duplicates()
plot_grouped_hist_plots(2,4,'CodingActivities','ResourceCount',Numb_Resources_Coding,16,5,'Number of Resources',' How many Resources do different Coding Activities require ? \n')
Inference:
- Most survey takers on average use at least 3 different resources for their development.
- Most age groups use more than 3 resources except under 18's, who rely on under 3 resources for their continuous learning.
- As seen in the age group comparison, under 18's rely on under 3 resources. This is reflected in the education levels as well where most degrees have an average of 3 or more, Primary/Elementary school students rely on under 3 resources.
- Main Branch, Employment Level, industry, coding activities have no significant impact on the number of resources, most survey takers use at least 3 resources.
Comparison of Resources with Main Branch, Education Level,Coding Activity,and Years of experienceΒΆ
How does the Resource used vary with MainBranch, Education level, Employment Status and Years of coding experience?
#Variation with Main Branch
Branch_learning=Resources[['ResponseId','LearnCode','MainBranch']]
Branch_learning=Branch_learning.dropna(subset='LearnCode')
Branch_learning=Branch_learning.dropna(subset='MainBranch')
Branch_learning=Branch_learning.drop_duplicates()
df=Branch_learning.groupby(['LearnCode', 'MainBranch']).size().unstack()
#plotting_multiple(1,3,Branch_learning,MainBranch,16,20,'LearnCode','Count',' Comparison of Resources used and Main Branch \n')
plot_grouped_bar_plots(1,3,'LearnCode','MainBranch',df,20,10,'LearnCode','Count','Comparison of Resources used and Main Branch \n',colors_Resources)
Inference:
- Regardless of the main branch, Online resources such as videos, blogs etc are a popular resource choice.
- The highest percentage of people who use books and other physical media are past developers.
- Developers and past developers are more likely to also learn on the job training.
- People who are not primarily developers, are more likely to use online certification courses as a resource choice.
#Variation with education level
ED_learning=Resources[['ResponseId','LearnCode','EdLevel']]
ED_learning=ED_learning.dropna(subset='LearnCode')
ED_learning=ED_learning.dropna(subset='EdLevel')
ED_learning=ED_learning.drop_duplicates()
df=ED_learning.groupby(['LearnCode', 'EdLevel']).size().unstack()
#plotting_multiple(2,4,ED_learning,EdLevel,25,21,'LearnCode','Count','Comparison of Resources used and Education Level \n')
plot_grouped_bar_plots(4,2,'LearnCode','EdLevel',df,20,20,'LearnCode','Count','Comparison of Resources used and Education Level \n',colors_Resources)
Inference:
- Primary/elementary and high school students are less likely to rely on their schools as a source of resource to learn coding. They are more likely to rely on family or friends.
- People with a formal degree - Bachelor's , master's, associates or professional degrees - rely on online videos, job training, online courses or books for continuous learning.
#Do people who partake in particular coding activities rely on certain resources ?
Emp_learning=Resources[['ResponseId','LearnCode','CodingActivities']]
Emp_learning=Emp_learning.dropna(subset='LearnCode')
Emp_learning=Emp_learning.dropna(subset='CodingActivities')
Emp_learning=Emp_learning.drop_duplicates()
df=Emp_learning.groupby(['LearnCode', 'CodingActivities']).size().unstack()
plot_grouped_bar_plots(2,4,'LearnCode','CodingActivities',df,16,16,'LearnCode','Count','Comparison of Resources used Coding Activities \n',colors_Resources)
Inference:
- Online resources such as videos, blogs, online forums remain a popular choice regardless of the coding activity.
- People who don't code outside of work are more likely to rely on job training, and colleague than any other activities for their professional development.
- People who perform coding activities for academic work,open source projects, freelance or bootstrapping a business are more likely to rely on online courses as a resource compared to other coding activities.
- Coding bootcamps are a more common resource choice among those who perform coding for academic work or freelance work.
#Variation with experience
Exp_learning=Resources[['ResponseId','LearnCode','YearsCode']]
Exp_learning=Exp_learning.dropna(subset='LearnCode')
Exp_learning=Exp_learning.dropna(subset='YearsCode')
Exp_learning=Exp_learning.drop_duplicates()
Exp_learning
Exp_learning['YearsCode_Range']=Exp_learning['YearsCode'].apply(YearCode_range)
df=Exp_learning.groupby(['LearnCode', 'YearsCode_Range']).size().unstack()
num_cols=3
fig,axes=plt.subplots(3,3,figsize=(16,12),sharex=True,sharey=False)
for i, (YearsCode_Range, data) in enumerate(df.items()):
colors_x=colors_Resources
row = i // num_cols
col = i % num_cols
colors_x=df.merge(colors_Resources, on='LearnCode')['Color']
data.plot.bar(ax=axes[row][col], color=colors_x)
axes[row][col].set_title(YearsCode_Range)
axes[row][col].set_xlabel('LearnCode')
axes[row][col].set_ylabel('Count')
axes[row][col].tick_params(axis='x')
# Add annotations to the bars
for bar in axes[row][col].patches:
axes[row][col].annotate(str(bar.get_height()),(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha='center', va='center', xytext=(0, 5),textcoords='offset points')
axes[1][1].xaxis.set_tick_params(which='both', labelbottom=True, labeltop=False,rotation=90)
axes[1][2].xaxis.set_tick_params(which='both', labelbottom=True, labeltop=False,rotation=90)
axes[2][0].tick_params(axis='x',rotation=90)
fig.suptitle(' Comparison of Resources and Years of Coding Experience \n', fontsize=20)
fig.delaxes(axes[2][1])
fig.delaxes(axes[2][2])
plt.tight_layout()
plt.show()
Inference:
- People with more experience (experience of more than 30 years), ie, older age groups, prefer books or other physical media as a resource for professional development. It is also noticed that even among these groups, preference for books increases with age. This trend is also seen in preference for on the job training.
- Coding bootcamps are a more popular option among survey takers early in their careers (experience somewhere between 0-10 years).
- Younger generation prefer online videos for continuous learning.
- Online courses or certification preference goes down with age/number of years of experience.
Programming language and TechnologyΒΆ
This section provides an analysis of the different programming technology preferred across different demographics.
programming_df=survey_df.copy()
programming_df=programming_df[['ResponseId','MainBranch','Country','Age','EdLevel','Employment','CodingActivities','YearsCode','DevType','Industry','LanguageHaveWorkedWith','DatabaseHaveWorkedWith']]
programming_df['Employment']=programming_df['Employment'].str.split(';')
programming_df['CodingActivities']=programming_df['CodingActivities'].str.split(';')
programming_df['Num_CodingActivities']=programming_df['CodingActivities'].str.len()
programming_df['LanguageHaveWorkedWith']=programming_df['LanguageHaveWorkedWith'].str.split(';')
programming_df['Num_LanguageHaveWorkedWith']=programming_df['LanguageHaveWorkedWith'].str.len()
programming_df['DatabaseHaveWorkedWith']=programming_df['DatabaseHaveWorkedWith'].str.split(';')
programming_df['Num_DatabaseHaveWorkedWith']=programming_df['DatabaseHaveWorkedWith'].str.len()
programming_df['YearsCode_Range']=programming_df['YearsCode'].apply(YearCode_range)
programming_df=programming_df.explode('Employment')
programming_df=programming_df.explode('CodingActivities')
programming_df=programming_df.explode('LanguageHaveWorkedWith')
programming_df=programming_df.explode('DatabaseHaveWorkedWith')
Programming Language ComparisonΒΆ
In this section, the preference of programming language with main branch, industry and developer type is analysed.
Programming Language DistributionΒΆ
Below is a word cloud with all the potential programming language options.
Languages_df=programming_df[['ResponseId','LanguageHaveWorkedWith']]
Languages_df.drop_duplicates()
Languages_df=Languages_df.groupby('LanguageHaveWorkedWith').count()[['ResponseId']]
Languages_df['Perc_Language']=(Languages_df['ResponseId']/Languages_df['ResponseId'].sum())*100
#There are 51 unique programming languages selected by the survey takers. Because a pie chart would not make sense for such data a word cloud is generated
Languages_df=Languages_df.reset_index()
Languages_df
word_freq = dict(zip(Languages_df['LanguageHaveWorkedWith'], Languages_df['ResponseId']))
# Create a WordCloud object
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Languages_group=programming_df[['ResponseId','LanguageHaveWorkedWith']].groupby('LanguageHaveWorkedWith').count()[['ResponseId']]
Languages_group=Languages_group.sort_values('ResponseId',ascending=False)
draw_pie(Languages_group,'LanguageHaveWorkedWith',8,8,'Programming Language Popularity',All_Lang_colors)
LanguageHaveWorkedWith JavaScript 11.202633 SQL 9.628019 HTML/CSS 9.175934 Python 7.936267 TypeScript 7.596819 Bash/Shell (all shells) 6.088770 Java 5.344882 C# 4.872780 PHP 4.452166 Go 3.248398 C++ 3.223404 C 2.933065 PowerShell 2.892007 Rust 2.156316 Kotlin 1.878602 Ruby 1.532823 Dart 1.341618 Lua 1.143277 Visual Basic (.Net) 1.005061 Swift 0.995693 Groovy 0.876213 VBA 0.872809 Assembly 0.847047 R 0.806098 Perl 0.703158 Scala 0.697706 Delphi 0.687496 Objective-C 0.669565 MATLAB 0.610648 Elixir 0.547706 Solidity 0.390827 Lisp 0.341534 Haskell 0.325103 Erlang 0.292278 Clojure 0.290229 GDScript 0.270285 F# 0.256306 Fortran 0.243315 Prolog 0.206721 Julia 0.202220 Cobol 0.185460 Apex 0.142022 Zig 0.141803 Ada 0.129910 SAS 0.128812 Crystal 0.125774 OCaml 0.111832 Nim 0.094669 Flow 0.064735 APL 0.045743 Raku 0.043437 Name: Perc, dtype: float64
Programming languages developers preferred are plotted above. There is no particular language that dominates. JavaScript, SQL, HTML/CSS, Python and TypeScript are the top 5 most popular programming languages.
top_ten=Languages_group.sort_values('Perc',ascending=False).head(10)
top_ten=top_ten.reset_index()
top_ten[top_ten.Perc.cumsum()<=50][['LanguageHaveWorkedWith','Perc']]
| LanguageHaveWorkedWith | Perc | |
|---|---|---|
| 0 | JavaScript | 11.202633 |
| 1 | SQL | 9.628019 |
| 2 | HTML/CSS | 9.175934 |
| 3 | Python | 7.936267 |
| 4 | TypeScript | 7.596819 |
5 programming languages make up 50% of the selection by survey takers.
Programming Language Variation with Main BranchΒΆ
#Correlation between main branch and Language
Language_Branch=programming_df[['ResponseId','LanguageHaveWorkedWith','MainBranch']]
Language_Branch=Language_Branch.dropna(subset='LanguageHaveWorkedWith')
Language_Branch=Language_Branch.dropna(subset='MainBranch')
Language_Branch=Language_Branch.drop_duplicates()
df=Language_Branch.groupby(['LanguageHaveWorkedWith', 'MainBranch']).size().unstack()
plot_grouped_bar_plots(1,3,'LanguageHaveWorkedWith', 'MainBranch',df,20,10,'Programming Language','Count','Copmarison of top 10 Programming Language and Main Branch \n',All_Lang_colors,1)
Inference:
- TypeScript is not a popular programming option for those who are not primarily developers but write code as part of studies or work. Python and SQL are the most popular programming languages in this group.
- Among current and past developers, JavaScript is the most popular programming language, followed by SQL.C# is also most preferred by current and past developers.
- The highest percentage of survey takers who use HTML/CSS programming language are current developers.
- The highest percentage of people who use Bash/Shell are those who write code as part of studies/work.
- PHP is most popular among past developers.
Programming Language Variation with IndustryΒΆ
#Correlation between Programming Language and Industry
Language_Industry=programming_df[['ResponseId','LanguageHaveWorkedWith','Industry']]
Language_Industry=Language_Industry.dropna(subset='LanguageHaveWorkedWith')
Language_Industry=Language_Industry.dropna(subset='Industry')
Language_Industry=Language_Industry.drop_duplicates()
df=Language_Industry.groupby(['LanguageHaveWorkedWith', 'Industry']).size().unstack()
plot_grouped_bar_plots(3,4,'LanguageHaveWorkedWith', 'Industry',df,25,21,'Programming Language','Count','Comparison of top 10 Programming Language used and Industry \n',All_Lang_colors,1)
Inference:
- JavaScript remains the most popular programming language among different industries. IT sector and Legal services have the highest percentage of JavaScript users.
- Go programming language is only preferred by Advertising, Retail and Consumer and Financial Services.
- Oil and Gas, Advertising industry uses a variety of programming languages.
- PowerShell is popular in Financial, Healthcare, Insurance, Legal services, Oil & Gas and Wholesale.
- Highest percentage of users who use HTML/CSS are from the higher education industry.
- Python is a popular choice among the Oil & Gas industry and Higher education.
- Rust programming language is only popular among the Advertising Industry.
Histogram of number of languages used in different Industries:
Numb_Lang_Industry=programming_df[['ResponseId','Industry','Num_LanguageHaveWorkedWith']]
Numb_Lang_Industry=Numb_Lang_Industry.dropna(subset='Industry')
Numb_Lang_Industry=Numb_Lang_Industry.dropna(subset='Num_LanguageHaveWorkedWith')
Numb_Lang_Industry=Numb_Lang_Industry.drop_duplicates()
plot_grouped_hist_plots(4,3,'Industry','Num_LanguageHaveWorkedWith',Numb_Lang_Industry,16,16,'Number of Languages',' How many Languages do different Industry use ? \n')
Inference:
- Most survey takers,regardless of the industry, use fewer programming languages.
- Most survey takers from any industry use an average of 5 programming languages.
- Survey takers from the IT sector on average use 5-10 programming languages.
Programming Language Variation with Developer TypeΒΆ
Developer type indicates the fundamental developer role the survey taker most identifies with.
#Developers pie chart
Dev_group=survey_df[['ResponseId','DevType']]
Dev_group=Dev_group.dropna(subset='DevType')
Dev_group=Dev_group.groupby(['DevType']).count()[['ResponseId']]
Dev_group=Dev_group.sort_values('ResponseId',ascending=False)
draw_pie(Dev_group,'DevType',8,8,'Coding Activities',colors_Dev)
DevType Developer, full-stack 35.005340 Developer, back-end 19.241620 Developer, front-end 6.450533 Developer, desktop or enterprise applications 5.432028 Developer, mobile 3.489058 Other 3.336043 Engineering manager 3.061891 Developer, embedded applications or devices 2.527933 Data scientist or machine learning specialist 2.025853 Senior Executive (C-Suite, VP, etc.) 1.958909 DevOps specialist 1.906310 Research & Development role 1.840960 Engineer, data 1.716636 Cloud infrastructure engineer 1.467986 Academic researcher 1.370758 Developer, game or graphics 1.058353 Data or business analyst 0.937216 System administrator 0.773044 Developer, QA or test 0.753917 Project manager 0.717257 Product manager 0.610466 Engineer, site reliability 0.600902 Security professional 0.575400 Developer Experience 0.447887 Scientist 0.441512 Educator 0.428760 Blockchain 0.390507 Hardware Engineer 0.352253 Database administrator 0.328344 Developer Advocate 0.296466 Designer 0.290091 Marketing or sales professional 0.121137 Student 0.044629 Name: Perc, dtype: float64
35% of the survey takers are Developers, Full-Stack, followed by About 20% Developer, back-end - implying more than 50% of the survey takers are Developers - full stack or backend.
For brevity variation with only the top 10 most popular choices (other than the option 'Other') is performed below:
- Developer, full-stack
- Developer, back-end
- Developer, front-end
- Developer, desktop or enterprise applications
- Developer, mobile
- Engineering manager
- Developer, embedded applications or devices
- Data scientist or machine learning specialist
- Senior Executive (C-Suite, VP, etc.)
- DevOps specialist
Language_Dev=programming_df[['ResponseId','LanguageHaveWorkedWith','DevType']]
Language_Dev=Language_Dev.dropna(subset='LanguageHaveWorkedWith')
Language_Dev=Language_Dev.dropna(subset='DevType')
Language_Dev=Language_Dev.drop_duplicates()
filtered_list=['Developer, full-stack','Developer, back-end','Developer, front-end','Developer, desktop or enterprise applications',
'Developer, mobile','Engineering manager', 'Developer, embedded applications or devices', 'Data scientist or machine learning specialist', 'Senior Executive (C-Suite, VP, etc.)','DevOps specialist']
Language_Dev['DevCheck']=Language_Dev['DevType'].isin(filtered_list)
Language_Dev=Language_Dev[Language_Dev['DevCheck']==True]
df=Language_Dev.groupby(['LanguageHaveWorkedWith', 'DevType']).size().unstack()
plot_grouped_bar_plots(2,5,'LanguageHaveWorkedWith', 'DevType',df,20,12,'DevType','Count',' Comparison of top 10 Programming Language and DevType\n',All_Lang_colors,1)
Inference:
- Python is the most popular choice among Data Scientist, DevOps Specialist.
- Full stack, frontend Developers, Engineering manager and Senior Executives prefer JavaScript.
- Mobile Developers are the only developers in the top 10 that use Kotlin, Dart,Swift and Objective-C. Most of them prefer Kotlin over other languages.
- SQL is a popular programming language among Developers - Desktop applications and back-end.
- Front end developers are more likely to use TypeScript than any other group in the top 10.
- Most unique programming languages are used by Developers, mobile.
- Embedded application developers prefer C and C++.
- Backend developers are most likely to use Java programming language.
Numb_Lang_Dev=programming_df[['ResponseId','DevType','Num_LanguageHaveWorkedWith']]
Numb_Lang_Dev=Numb_Lang_Dev.dropna(subset='DevType')
Numb_Lang_Dev=Numb_Lang_Dev.dropna(subset='Num_LanguageHaveWorkedWith')
Numb_Lang_Dev=Numb_Lang_Dev.drop_duplicates()
plot_grouped_hist_plots(11,3,'DevType','Num_LanguageHaveWorkedWith',Numb_Lang_Dev,16,26,'Number of Languages',' How many Languages do different developers use ? \n')
Inference:
- Most developers use fewer languages - an average of 5 programming languages.
- Project Manager, Marketing professional, hardware engineer, educator, site reliability engineer are few developer types that have high variation in the number of languages preferred.
Database ComparisonΒΆ
In this section, the preference of database with education level, years of experience, industry and developer type is analysed.
Database DistributionΒΆ
Below is a word cloud with all the potential database options.
Database_df=programming_df[['ResponseId','DatabaseHaveWorkedWith']]
Database_df.drop_duplicates()
Database_df=Database_df.groupby('DatabaseHaveWorkedWith').count()[['ResponseId']]
Database_df['Perc_Database']=(Database_df['ResponseId']/Database_df['ResponseId'].sum())*100
#There are 51 unique programming languages selected by the survey takers. Because a pie chart would not make sense for such data a word cloud is generated
Database_df=Database_df.reset_index()
Database_df
word_freq = dict(zip(Database_df['DatabaseHaveWorkedWith'], Database_df['ResponseId']))
# Create a WordCloud object
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Database_group=programming_df[['ResponseId','DatabaseHaveWorkedWith']].groupby('DatabaseHaveWorkedWith').count()[['ResponseId']]
Database_group=Database_group.sort_values('ResponseId',ascending=False)
draw_pie(Database_group,'DatabaseHaveWorkedWith',8,8,'Database Popularity',All_Database_colors)
DatabaseHaveWorkedWith PostgreSQL 14.464305 MySQL 12.490508 SQLite 10.464157 Microsoft SQL Server 8.207899 MongoDB 7.963445 Redis 7.812013 MariaDB 6.566617 Elasticsearch 5.448596 Oracle 3.257834 Dynamodb 3.241684 Firebase Realtime Database 2.296615 Cloud Firestore 2.130935 Microsoft Access 1.782568 BigQuery 1.637999 Cosmos DB 1.373441 H2 1.358596 InfluxDB 1.182175 Cassandra 1.095307 Supabase 0.938802 Snowflake 0.854918 Neo4J 0.789086 IBM DB2 0.785916 Solr 0.694124 Couch DB 0.636498 Firebird 0.544930 Cockroachdb 0.510765 Clickhouse 0.434937 Couchbase 0.384435 DuckDB 0.285408 RavenDB 0.144905 Datomic 0.123868 TiDB 0.096715 Name: Perc, dtype: float64
PostgreSQL, MySQL, SQLite, Microsoft SQL Server, MongoDB are the 5 most common database options among survey takers.
top_ten=Database_group.sort_values('Perc',ascending=False).head(10)
top_ten=top_ten.reset_index()
top_ten[top_ten.Perc.cumsum()<=50][['DatabaseHaveWorkedWith','Perc']]
top_ten[top_ten.Perc.cumsum()<=50][['DatabaseHaveWorkedWith','Perc']]
| DatabaseHaveWorkedWith | Perc | |
|---|---|---|
| 0 | PostgreSQL | 14.464305 |
| 1 | MySQL | 12.490508 |
| 2 | SQLite | 10.464157 |
| 3 | Microsoft SQL Server | 8.207899 |
4 database options - PostgreSQL, MySQL,SQLite, Microsoft SQL Server - make up 50% of the survey takers database options.
#Correlation between main branch and Language,database
Database_Branch=programming_df[['ResponseId','DatabaseHaveWorkedWith','MainBranch']]
Database_Branch=Database_Branch.dropna(subset='DatabaseHaveWorkedWith')
Database_Branch=Database_Branch.dropna(subset='MainBranch')
Database_Branch=Database_Branch.drop_duplicates()
df=Database_Branch.groupby(['DatabaseHaveWorkedWith', 'MainBranch']).size().unstack()
plot_grouped_bar_plots(1,3,'DatabaseHaveWorkedWith', 'MainBranch',df,16,10,'Database','Count', 'Comparison of top 10 Database and Main Branch \n',All_Database_colors,1)
Inference:
- PostgreSQL is most preferred by current and past developers.
- MySQL is preferred by those who write code sometimes and also by past developers after PostgreSQL.
- Users of SQLite are most like those that code as part of studies or work.
- Current and Past developers use Dynamodb database.
#Correlation between Industry and database
Database_Industry=programming_df[['ResponseId','DatabaseHaveWorkedWith','Industry']]
Database_Industry=Database_Industry.dropna(subset='DatabaseHaveWorkedWith')
Database_Industry=Database_Industry.dropna(subset='Industry')
Database_Industry=Database_Industry.drop_duplicates()
df=Database_Industry.groupby(['DatabaseHaveWorkedWith', 'Industry']).size().unstack()
plot_grouped_bar_plots(3,4,'DatabaseHaveWorkedWith', 'Industry',df,25,21,'Database','Count','Comparison of Top 10 Database used and Industry \n',All_Database_colors,1)
Inference:
- PostgreSQL is preferred amongst most industries except Advertising, Insurance, Oil & Gas.
- Wholesale industry survey takers prefer both PostgreSQL and MySQL.
- Advertising industry prefer MySQL over PostgreSQL.
- Microsoft SQL Server is a popular option in the Oil & gas and Insurance industry.
- Only Advertising and Whole sale industries use BigQuery.
- Only legal services industry use Cosmos DB.
Numb_DB_Industry=programming_df[['ResponseId','Industry','Num_DatabaseHaveWorkedWith']]
Numb_DB_Industry=Numb_DB_Industry.dropna(subset='Industry')
Numb_DB_Industry=Numb_DB_Industry.dropna(subset='Num_DatabaseHaveWorkedWith')
Numb_DB_Industry=Numb_DB_Industry.drop_duplicates()
plot_grouped_hist_plots(4,3,'Industry','Num_DatabaseHaveWorkedWith',Numb_DB_Industry,16,12,'Number of Databases',' How many Database have different industries used ? \n')
Inference:
- Most industries use fewer databases - averaging under 5 databases.
#Correlation between Dev type and Database
Database_Dev=programming_df[['ResponseId','DatabaseHaveWorkedWith','DevType']]
Database_Dev=Database_Dev.dropna(subset='DatabaseHaveWorkedWith')
Database_Dev=Database_Dev.dropna(subset='DevType')
Database_Dev=Database_Dev.drop_duplicates()
Database_Dev
filtered_list=['Developer, full-stack','Developer, back-end','Developer, front-end','Developer, desktop or enterprise applications',
'Developer, mobile','Engineering manager', 'Developer, embedded applications or devices', 'Data scientist or machine learning specialist', 'Senior Executive (C-Suite, VP, etc.)','DevOps specialist']
Database_Dev['DevCheck']=Database_Dev['DevType'].isin(filtered_list)
Database_Dev=Database_Dev[Database_Dev['DevCheck']==True]
df=Database_Dev.groupby(['DatabaseHaveWorkedWith', 'DevType']).size().unstack()
plot_grouped_bar_plots(2,5,'DatabaseHaveWorkedWith', 'DevType',df,20,12,'DevType','Count','Comparison of top 10 Database used and Dev Type \n',All_Database_colors,1)
Inference:
- PostgreSQL is preferred by Data Scientists, DevOps Specialist, Backend developers, full stack developer, Engineering Managers and Senior Executives.
- Survey takers who use Microsoft SQL Server are most likely to be Desktop or Enterprise application developers.
- Survey takers who use SQLite are most likely to be Embedded application or devices developers.
- MySQL is most preferred by frontend developers.
- Front End developers are the only ones likely to use Cloud Firestore database among the top 10 developer types.
- Desktop or Enterprise application developers are the only ones likely to use Firebird database among the top 10 developer types.
- Users of BigQuery are likely to be Data Scientists.
Numb_DB_Dev=programming_df[['ResponseId','DevType','Num_DatabaseHaveWorkedWith']]
Numb_DB_Dev=Numb_DB_Dev.dropna(subset='DevType')
Numb_DB_Dev=Numb_DB_Dev.dropna(subset='Num_DatabaseHaveWorkedWith')
Numb_DB_Dev=Numb_DB_Dev.drop_duplicates()
plot_grouped_hist_plots(11,3,'DevType','Num_DatabaseHaveWorkedWith',Numb_DB_Dev,16,26,'Number of Databases',' How many Database have different developers used ? \n')
Inference:
- Most developers use an average of under 5 databases.
- Most variation in databases used are by Security professionals, Engineering Managers, Reserach roles, Product manager, Database administrator.
Stack Overflow Usage, Contribution and Community EngagementΒΆ
In this section, we delve into the characteristics of the developer community, exploring their engagement with Stack Overflow, their contribution to the platform, and their participation in the developer community at large.
SO_group=survey_df.copy()
SO_group['YearsCode_Range']=SO_group['YearsCode'].apply(YearCode_range)
Stack Overflow Account StatusΒΆ
#Starting with identifying how many survey takers have a stackoverflow account:
SO_acc=SO_group.groupby('SOAccount').count()[['ResponseId']]
draw_pie(SO_acc,'SOAccount',8,8,'Have a Stack Overflow Account ? ',colors_SO_acc)
SOAccount Yes 77.998355 No 14.051378 Not sure/can't remember 7.950267 Name: Perc, dtype: float64
78% of the survey takers have a Stack Overflow Account.
Most popular Stack Overflow SitesΒΆ
#Most popular stackover flow sites ?
#colors_SO_sites=['blue','purple','lavender','forestgreen','orange']
SO_sites=SO_group[['ResponseId','NEWSOSites']]
SO_sites['NEWSOSites']=SO_sites['NEWSOSites'].str.split(';')
SO_sites=SO_sites.explode('NEWSOSites')
SO_sites=SO_sites.drop_duplicates()
SO_sites_group=SO_sites.groupby('NEWSOSites').count()[['ResponseId']]
draw_pie(SO_sites_group,'NEWSOSites',8,8,'Most used Stack Websites ? ',colors_SO_sites)
NEWSOSites Stack Overflow 54.957619 Stack Exchange 37.758883 Collectives on Stack Overflow 4.012745 Stack Overflow for Teams (private knowledge sharing & collaboration platform for companies) 3.077113 I have never visited Stack Overflow or the Stack Exchange network 0.193640 Name: Perc, dtype: float64
55% of the survey takers use Stack Overflow site the most, followed by 38% using the Stack Exchange website. 0.2% of the survey takers have never visited either Stack Overflow or Stack Exchange networks.
Stack Overflow Visit FrequencyΒΆ
#Site Visits
SO_visit=SO_group[['ResponseId','SOVisitFreq']]
SO_visit['SOVisitFreq'].dropna().unique() #5 options
SO_visit_group=SO_visit.groupby('SOVisitFreq').count()[['ResponseId']]
draw_pie(SO_visit_group,'SOVisitFreq',8,8,'How Often do you visit Stackoverflow ? ',colors_SO_VisitFreq)
SOVisitFreq A few times per week 32.099060 Daily or almost daily 26.027767 A few times per month or weekly 22.971470 Multiple times per day 14.218770 Less than once per month or monthly 4.682933 Name: Perc, dtype: float64
Visit frequencies are fairly even, with just under 5% of the survey takers visiting Stack Overflow less than once a month. Most survey takers visit Stack Overflow at least few time a week, daily or monthly.
Stack Overflow Participation FrequencyΒΆ
#Site Participation
SO_PartFreq=SO_group[['ResponseId','SOPartFreq']]
SO_PartFreq_group=SO_PartFreq.groupby('SOPartFreq').count()[['ResponseId']]
draw_pie(SO_PartFreq_group,'SOPartFreq',8,8,'How often do you participate in Stackoverflow Q&A ? ',colors_SO_PartFreq)
SOPartFreq Less than once per month or monthly 54.963818 I have never participated in Q&A on Stack Overflow 22.396130 A few times per month or weekly 14.346695 A few times per week 5.160989 Daily or almost daily 2.044882 Multiple times per day 1.087487 Name: Perc, dtype: float64
Although most survey takers visit Stack Overflow a few times a week, more than 50% of survey takers are participating less than once a month in Stack Overflow Q&A, followed by 22% who have never participated. Therefore, despite survey takers visiting Stack Overflow frequently, a large percentage of these people are not actively engaged in the community. Only 2% of the responders participate almost daily, with just 1.1% participating multiple times a day.
Comparison of Stack Overflow accounts:ΒΆ
Comparison with Employment StatusΒΆ
#Employment vs whether they have stackoverflow accounts:
SO_acc_Emp=SO_group[['ResponseId','SOAccount','Employment']].dropna()
SO_acc_Emp['Employment']=SO_acc_Emp['Employment'].str.split(';')
SO_acc_Emp=SO_acc_Emp.explode('Employment')
SO_acc_Emp=SO_acc_Emp[SO_acc_Emp['Employment']!='I prefer not to say']
SO_acc_Emp=SO_acc_Emp.drop_duplicates()
df=SO_acc_Emp.groupby(['SOAccount','Employment']).size().unstack()
plot_grouped_pie_plots(2,2,'SOAccount','Employment',df,20,16,'Stack Overflow account - Employment',colors_SO_acc)
Inference:
- Retired survey takers are the most unlikely to have a Stack Overflow account.
- Freelances and part time employed survey takers are more likely to have a Stack Overflow account.
Comparison with Education LevelΒΆ
#EdLevel vs whether they have stackoverflow accounts:
SO_acc_ED=SO_group[['ResponseId','SOAccount','EdLevel']].dropna()
df=SO_acc_ED.groupby(['SOAccount','EdLevel']).size().unstack()
plot_grouped_pie_plots(4,2,'SOAccount','EdLevel',df,20,16,'Stack Overflow account - EdLevel',colors_SO_acc)
Inference:
- Education level does not show significant variation in whether they have a Stack Overflow account or not.
- Most unlikely to have a Stack Overflow account are those with an Associates Degree, Professional Degree or Some other form of education.
Comparison with Developer TypeΒΆ
#Dev Type vs whether they have stackoverflow accounts:
SO_acc_Dev=SO_group[['ResponseId','SOAccount','DevType']].dropna()
df=SO_acc_Dev.groupby(['SOAccount','DevType']).size().unstack()
plot_grouped_pie_plots(11,3,'SOAccount','DevType',df,20,28,'Stack Overflow account - Dev Type \n',colors_SO_acc)
Inference:
- Students, System Administrators, Hardware Engineer, Security Professional, Embedded applications developer and Database administrators are the most unlikely to have Stack Overflow accounts (No% >=20%).
- Developer Mobile, Engineering Manager, Developer Experience and Blockchain developers are most likely to have a Stack Overflow account (Yes%>=80%).
Comparison with Age GroupΒΆ
#Are older age groups using stackoverflow ?
SO_acc_Age=SO_group[['ResponseId','SOAccount','Age']].dropna()
df=SO_acc_Age.groupby(['SOAccount','Age']).size().unstack()
plot_grouped_pie_plots(2,4,'SOAccount','Age',df,16,8,'Stack Overflow account - Age \n',colors_SO_acc)
Inference:
- Survey takers in the workforce are more likely to have a Stack Overflow accounts.
- Older survey takers are more likely to not have a Stack Overflow account.
- Students in school (under 18) are more likely to have an account than those that are in university (18-24 years).
Comparison of Stack Overflow Participation FrequencyΒΆ
Comparison shows that the participation frequency does not depend on employment status, education level, age group or developer type.
#Emp vs So part freq:
SO_PartFreq_Emp=SO_group[['ResponseId','Employment','SOPartFreq']].dropna()
SO_PartFreq_Emp['Employment']=SO_PartFreq_Emp['Employment'].str.split(';')
SO_PartFreq_Emp=SO_PartFreq_Emp.explode('Employment')
SO_PartFreq_Emp=SO_PartFreq_Emp[SO_PartFreq_Emp['Employment']!='I prefer not to say']
df=SO_PartFreq_Emp.groupby(['Employment','SOPartFreq']).size().unstack()
plot_grouped_pie_plots(3,2,'Employment','SOPartFreq',df,26,25,'Stack Overflow Participation Frequency - Employment \n',colors_emp)
Inference:
- When participating in Stack Overflow Q&A, Independent contractor,Freelance or self-employed are more likely to interact multiple times a day rather than never participate.
- Survey takers who are Employed full time are least likely to participate in Stack Overflow community.
#EDLEVEL vs So part freq:
SO_PartFreq_ED=SO_group[['ResponseId','EdLevel','SOPartFreq']].dropna()
df=SO_PartFreq_ED.groupby(['EdLevel','SOPartFreq']).size().unstack()
plot_grouped_pie_plots(3,2,'EdLevel','SOPartFreq',df,26,25,'Stack Overflow Participation -Education Level \n',colors_ED)
Inference:
- Survey takers with a Bachelor's degree are more likely to not participate in Stack Overflow community.
- Survey takers with a Master's degree are more likely to interact multiple times a day.
- Survey takers with a Professional degree are more likely to interact daily or multiple times a day.
- School students and survey takers without a formal degree are more likely to never participate in Stack Overflow community.
#Age vs So part freq:
SO_PartFreq_Age=SO_group[['ResponseId','Age','SOPartFreq']].dropna()
df=SO_PartFreq_Age.groupby(['Age','SOPartFreq']).size().unstack()
plot_grouped_pie_plots(3,2,'Age','SOPartFreq',df,26,25,'Stack Overflow Participation - Age Group \n',colors_Age)
Inference:
- Survey takers in the age group 25-34 years and 18-24 years are most likely to have never participated in a Stack Overflow Q&A.
- Survey takers in the age group 35-44 years are more likely to interact a few times per week.
- Survey takers who are much older, in the age group 44-54 years and 55-64 years, are m ore likely to interact multiple times a day.
#dev vs So part freq:
#colors_SO_PartFreq=['skyblue','brown','peru','crimson','lime','pink']
SO_PartFreq_Dev=SO_group[['ResponseId','DevType','SOPartFreq']].dropna()
df=SO_PartFreq_Dev.groupby(['DevType','SOPartFreq']).size().unstack()
plot_grouped_pie_plots(3,2,'DevType','SOPartFreq',df,20,25,'Stack Overflow Participation - Dev Type \n',colors_Dev,1)
Inference:
- Full stack developers are more likely to interact with Stack Overflow community daily or almost daily.
- Backend Developers are more likely to have never participated in Stack Overflow community.
- Desktop or application enterprise developers are more likely to participate multiple times a day.
SummaryΒΆ
The demographic analysis of the Stack Overflow Developer Survey dataset provides a comprehensive view of the global developer community's engagement with the platform and participation in the broader developer ecosystem. Through a detailed exploration of various metrics, valuable insights into the demographic composition of Stack Overflow users and their patterns of interaction with the platform were gained.
The analysis revealed that a significant proportion of developers across different industries, employment statuses, education levels, developer types, age groups, and years of coding experience are active users of Stack Overflow, highlighting the platform's widespread adoption among developers worldwide. Furthermore, variations in platform usage across different demographic groups, with factors such as industry sector, employment status, and education level influencing developers' levels of participation in the Stack Overflow community were noticed.
Overall, the demographic analysis underscores the importance of Stack Overflow as a vital resource and community hub for developers, providing them with valuable support, knowledge-sharing opportunities, and collaborative networks. By understanding the demographic characteristics and engagement patterns of developers on Stack Overflow, organizations and stakeholders in the tech industry can tailor their strategies and initiatives to better meet the needs of the developer community and foster a thriving ecosystem of innovation and collaboration in software development.